Udacity Data Science Nanodegree¶
Project 1: Writing a Data Scientist Blog Post¶
Notebook 4 of 4: London Airbnb Data¶
Business Question 4: Predict average daily listing price¶
by Juanita Smith¶
- Introduction
- Data gathering
- Feature Engineering
- Univariate Exploration
- Bivariate Exploration
- Multivariate Exploration
- Clustering
- Preprocessing
- Modelling Step 1 - Setting a baseline with Ridge and small alpha
- Modelling Step 2 - Model improvement with RidgeCV
- Final evaluation using TEST data
- Further evaluation using unseen Airbnb with date stamp 06/09/2023
- Conclusions
- References
- Submission
In this notebook, we will use linear regression modelling to answer the last business question:
Business Question 4: Predict average daily listing price
Knowing the impact of business rentals, host excellence and star ratings from notebooks 2 and 3, could we make an accurate average daily base price prediction ?
This notebook will use the cleaned files created by running notebook 1 and 2 in sequence. Refer to the README file for data download instructions and metadata descriptions.
As the project is predicting a continues variable, we will make use of linear regression algorithms. To use linear regression, below assumptions needs to be tested to build a model that will generalize well.
Linear Regression Assumptions:
- Target variable is normally distributed
- There should be a linear relationship between the target and the features
- No Multicollinearity - Independent features should have no relationship between them
- Normality of the Error Terms, meaning residuals should be normally distributed with mean as centre
- Homoscedasticity - the spread of the residuals should be consistent along the regression line.
# avoid memory leaks
import os
os.environ["OMP_NUM_THREADS"] = '1'
# import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# clear the garbage to free memory as we are working with huge datasets
import gc
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, PredictionErrorDisplay, median_absolute_error, \
mean_absolute_error
from sklearn.preprocessing import StandardScaler
import scipy as sp
from patsy import dmatrices
from scipy import stats
from scipy.special import boxcox, inv_boxcox
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import cross_validate, RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from statsmodels.stats.diagnostic import normal_ad
from sklearn.mixture import GaussianMixture
from sklearn.feature_selection import mutual_info_regression
from pickle import dump
# Import custom modules
import project1_data_science_blog.src.visuals as vs
# set plots to be embedded inline
%matplotlib inline
# suppress matplotlib user warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib")
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")
# # use high resolution if this project is run on an apple device
%config InlineBackend.figure_format='retina'
# Make your Jupyter Notebook wider
from IPython.display import display, HTML
display(HTML('<style>.container { width:80% !important; }</style>'))
# environment settings
# display all columns and rows during visual inspection
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
# geospatial analysis
import geopandas as gpd
from shapely.geometry import Point
from pyproj import CRS
gpd.options.display_precision = 15
# set default plot formatting and colors
BASE_COLOR, BASE_HIGHLIGHT_INTENSE, BASE_HIGHLIGHT, BASE_COMPLIMENTARY, BASE_GREY, BASE_COLOR_SUB1, BASE_COLOR_SUB2, SYMBOLS = vs.set_plot_defaults()
# default file paths
# time stamp model was developed with
date_stamp = '060623'
# latest time stamp we can use to validate our model as an optional last step towards the end.
date_stamp_latest = '060923'
FILE_PATH_CLEAN = '../data/clean/'
FILE_NAME_ENHANCED = FILE_PATH_CLEAN + 'listing_enhanced_' + date_stamp + '.pkl'
FILE_NAME_ENHANCED_LATEST = FILE_PATH_CLEAN + 'listing_enhanced_' + date_stamp_latest + '.pkl'
random_state = 77
def log_transform(x):
return np.log10(x)
def inverse_log_transform(x):
return sp.special.exp10(x)
def goodness_of_fit(y, y_hat, show=False):
"""
Calculates and returns the performance score between true and predicted values based on the metric chosen.
"""
#Calculate the performance scores
r2 = r2_score(y, y_hat)
mse = mean_squared_error(y, y_hat)
mae = mean_absolute_error(y, y_hat)
maeEd = median_absolute_error(y, y_hat)
if show:
print("R-squared: {:.2f}".format(r2))
print("Mean Squared Error (MSE): {:.2f}".format(mse))
print("Median Absolute Error: £{:.2f} per night".format(maeEd))
print("Mean Absolute Error: £{:.2f} per night".format(mae))
return {
"R-squared": f"{r2_score(y, y_hat):.3f}",
"Median Absolute Error": f"£{median_absolute_error(y, y_hat):.2f} per night",
"Mean Absolute Error": f"£{mean_absolute_error(y, y_hat):.2f} per night",
"Mean Squared Error": f"{mean_squared_error(y, y_hat):.2f}",
}
def get_mi_score(X, y):
mi = mutual_info_regression(X, y, random_state=10)
mi = pd.Series(mi, index=X.columns).sort_values(ascending=False)
return mi
def calculate_residuals(model, X, y):
"""
Creates predictions on the features with the model and calculates residuals
"""
predictions = model.predict(X)
df_results = pd.DataFrame({'Actual': y, 'Predicted': predictions})
df_results['Residuals'] = abs(df_results['Actual']) - abs(df_results['Predicted'])
return df_results
def plot_model_performance_scatter(
y, y_pred1, y_pred2, title1, title2, suptitle, kind='actual_vs_predicted', lower_lim=-200, upper_lim=1500, interval=100, figsize=(12,6)):
"""
Compare the performance of 2 models plotting its predictions vs actual values
"""
bins = np.arange(lower_lim, upper_lim+interval, interval)
fig, (ax0, ax1) = plt.subplots(1, 2, sharey=True, figsize=figsize)
plt.subplot(1,2,1)
PredictionErrorDisplay.from_predictions(
y,
y_pred1,
kind=kind,
ax=ax0,
scatter_kwargs={"alpha": 0.3, 'marker':'o', 'color':BASE_HIGHLIGHT_INTENSE, 's':50, 'edgecolor':'black', 'linewidths':0.3}
)
if kind == 'actual_vs_predicted':
ax0.set_xticks(bins, bins, rotation=90)
ax0.set_yticks(bins, bins)
ax0.set_xlim(lower_lim,upper_lim)
ax0.set_ylim(lower_lim,upper_lim)
else:
ax0.set_xticks(bins, bins, rotation=90)
ax0.set_xlim(lower_lim,upper_lim)
plt.subplot(1,2,2)
PredictionErrorDisplay.from_predictions(
y,
y_pred2,
kind=kind,
ax=ax1,
scatter_kwargs={"alpha": 0.3, 'marker':'o', 'color':BASE_HIGHLIGHT_INTENSE, 's':50, 'edgecolor':'black', 'linewidths':0.3},
)
if kind == 'actual_vs_predicted':
ax1.set_xticks(bins, bins, rotation=90)
ax1.set_yticks(bins, bins)
ax1.set_xlim(lower_lim,upper_lim)
ax1.set_ylim(lower_lim,upper_lim)
loc = 'upper left'
else:
ax1.set_xticks(bins, bins, rotation=90)
ax1.set_xlim(lower_lim,upper_lim)
loc = 'upper right'
ax1.yaxis.set_tick_params(labelleft=True)
# Add the score in the legend of each axis
for ax, y_pred in zip([ax0, ax1], [y_pred1, y_pred2]):
for name, score in goodness_of_fit(y, y_pred).items():
ax.plot([], [], " ", label=f"{name}={score}")
ax.legend(loc=loc)
ax0.set_title(title1)
ax1.set_title(title2)
fig.suptitle(suptitle)
plt.show()
def plot_model_performance_hist(y, y_pred1, y_pred2, title1=None, title2=None, suptitle=None, figsize=(12,12), price_max=1200, price_interval=25):
""" Compare performance of 2 models by plotting actual vs predicted target with seaborn histogram """
# Actual vs predicted price comparison
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=figsize, sharex=True, sharey=True)
# create bins
xbins = np.arange(0, price_max+price_interval, price_interval)
# plot for prediction from model 1
ax1 = sns.histplot(y, bins=xbins, stat='percent', kde=False, color=BASE_COLOR, label='actual', ax=ax1)
sns.histplot(y_pred1, bins=xbins, stat='percent', kde=False, color=BASE_COMPLIMENTARY, alpha=0.5, label='predicted', ax=ax1)
ax1.set_title(title1)
ax1.legend()
# plot for prediction from model 2
ax2 = sns.histplot(y, bins=xbins, stat='percent', kde=False, color=BASE_COLOR, label='actual', ax=ax2)
sns.histplot(y_pred2, bins=xbins, stat='percent', kde=False, color=BASE_COMPLIMENTARY, alpha=0.5, label='predicted', ax=ax2)
ax2.set_title(title2)
ax2.legend()
plt.xticks(xbins, xbins, rotation=90)
ax1.xaxis.set_tick_params(labelbottom=True, rotation=90)
plt.xlim(0,price_max)
plt.suptitle(suptitle)
plt.show()
def normal_errors_assumption(model, features, label, ax, p_value_thresh=0.05, title=None):
"""
Normality: Assumes that the error terms are normally distributed.
This means, the error term should have a mean of zero
If they are not, nonlinear transformations of variables may solve this.
This assumption being violated primarily causes issues with the confidence intervals
p-value from the test - below 0.05 generally means non-normal
"""
# Calculating residuals for the Anderson-Darling test
df_results = calculate_residuals(model, features, label)
# Performing the test on the residuals
p_value = normal_ad(df_results['Residuals'])[1]
# Reporting the normality of the residuals
if p_value < p_value_thresh:
text = 'Residuals are not normally distributed'
else:
text = 'Residuals are normally distributed'
# Plotting the residuals distribution
plt.title(title)
ax = sns.histplot(df_results['Residuals'], bins=100, stat='density', kde=True, color=BASE_COLOR)
ax = sns.kdeplot(df_results['Residuals'], color=BASE_HIGHLIGHT_INTENSE, ax=ax, linewidth=2)
ax.plot([], [], " ", label=f"pvalue={p_value}: {text}")
ax.legend(loc="upper left")
def distribution_of_errors(model1, model2, X, y, title1=None, title2=None, suptitle=None, alpha=None, figsize=(12,6), kind='kde'):
""" compare normal distribution of errors for 2 models"""
fig = plt.figure(figsize=figsize)
if kind == 'kde':
ax = plt.subplot(1,2,1)
normal_errors_assumption(model=model1, features=X, label=y, ax=ax, title=title1)
ax2 = plt.subplot(1,2,2, sharey=ax)
normal_errors_assumption(model=model2, features=X, label=y, ax=ax2, title=title2)
elif kind == 'hist':
ax1 = plt.subplot(1,2,1)
x = calculate_residuals(model1, X_train, y_train).values.flatten()
plt.hist(x, bins=100, color=BASE_HIGHLIGHT_INTENSE)
plt.xlim(-1000, 1000)
ax1.set_title(title1)
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Residuals')
ax2 = plt.subplot(1,2,2, sharex=ax1, sharey=ax1)
x = calculate_residuals(model2, X_train, y_train).values.flatten()
sns.histplot(x, bins=100, color=BASE_HIGHLIGHT_INTENSE)
ax2.set_title(title2)
ax2.set_ylabel('Frequency')
ax2.set_xlabel('Residuals')
plt.suptitle(suptitle, y=1.05)
plt.tight_layout()
def homoscedasticity_assumption(model, X, y, ax):
"""
Assumption 5: Homoscedasticity of Error Terms:
Assumes that the errors/residuals have a constant variance, with a mean of 0
"""
# Calculating residuals for the plot
df_results = calculate_residuals(model, X, y)
df_results.reset_index(inplace=True)
sns.scatterplot(x=df_results.index, y=df_results.Residuals, color=BASE_COLOR, marker='o', edgecolor='black', linewidth=0.1, alpha=0.2)
plt.plot(np.repeat(0, df_results.index.max()), color=BASE_COMPLIMENTARY, linestyle='--')
ax.spines['right'].set_visible(False) # Removing the right spine
ax.spines['top'].set_visible(False) # Removing the top spine
def residual_variance(model1, model2, X, y, title1=None, title2=None, suptitle=None, figsize=(14,4)):
""" Compare Homoscedasticityof 2 models """
plt.figure(figsize=figsize)
ax1 = plt.subplot(1,2,1)
ax1.set_title(title1)
ax1.set_xlabel('Index')
homoscedasticity_assumption(model1, X, y, ax=ax1)
ax2 = plt.subplot(1,2,2)
ax2.set_title(title2)
ax2.set_xlabel('Index')
homoscedasticity_assumption(model2, X, y, ax=ax2)
plt.suptitle(suptitle, y=1.05)
plt.show()
def coef_weights(model, logscale=False):
'''
INPUT:
coefficients - the coefficients of the linear model
X_train - the training data, so the column names can be used
OUTPUT:
coefs_df - a dataframe holding the coefficient, estimate, and abs(estimate)
Provides a dataframe that can be used to understand the most influential coefficients
in a linear model by providing the coefficient estimates along with the name of the
variable attached to the coefficient.
'''
coefs_df = pd.DataFrame()
coefs_df['est_int'] = feature_names = model[:-1].get_feature_names_out()
try:
coefs_df['coefs'] = model[-1].coef_
intercept = model[-1].intercept_
except:
coefs_df['coefs'] = model[-1].regressor_.coef_
intercept = model[-1].regressor_.intercept_
if logscale:
intercept = inverse_log_transform(intercept)
coefs_df['coefs'] = inverse_log_transform(coefs_df['coefs'])
coefs_df['abs_coefs'] = abs(coefs_df['coefs'])
coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
coefs_df = coefs_df.set_index('est_int')
return coefs_df, intercept
def plot_coef(model1, model2, title1, title2, suptitle, figsize=(12,20), logscale1=False, logscale2=False):
# plot for model 1
ax1 = plt.subplot(1,2,1)
coef_df, intercept = coef_weights(model1, logscale1)
coef_df['coefs'].plot.barh(figsize=figsize, color=BASE_COLOR, ax=ax1)
plt.title(title1 + '\n Intercept: £{:.0f}'.format(intercept))
start = 0
plt.xlabel("Raw coefficient values")
if logscale1:
start = 1
plt.axvline(x=start, color=".5")
plt.ylabel('Independent features')
plt.subplots_adjust(left=0.4)
for c in ax1.containers:
labels = [f"{x:,.3}" for x in c.datavalues]
ax1.bar_label(c, label=labels, label_type='edge', fontsize=6, padding=3)
ax1.margins(x=0.2)
# plot for model 2
ax2 = plt.subplot(1,2,2)
coef_df, intercept = coef_weights(model2, logscale2)
coef_df['coefs'].plot.barh(figsize=figsize, color=BASE_COLOR)
plt.title(title2 + '\n Intercept: £{:.0f}'.format(intercept))
plt.xlabel("Raw coefficient values")
if logscale2:
start = 1
plt.axvline(x=start, color=".5")
plt.ylabel('Independent features')
plt.subplots_adjust(left=0.4)
for c in ax2.containers:
labels = [f"{x:,.3}" for x in c.datavalues]
ax2.bar_label(c, label=labels, label_type='edge', fontsize=6, padding=3)
ax2.margins(x=0.2)
plt.suptitle(suptitle, y=1.01)
plt.tight_layout()
return coef_df
def coefficient_variability(model, X, y, title, suptitle, splits=5, repeats=20, jobs=4, random_state=0, scoring='r2', figsize=(10,12)):
feature_names = model[:-1].get_feature_names_out()
cv = RepeatedStratifiedKFold(n_splits=splits, n_repeats=repeats, random_state=random_state)
cv_model = cross_validate(
model,
X,
y,
scoring=scoring,
cv=cv,
return_estimator=True,
n_jobs=jobs,
return_train_score=True,
verbose=0
)
try:
coefs = pd.DataFrame(
[est[-1].regressor_.coef_ for est in cv_model["estimator"]], columns=feature_names,
)
except:
coefs = pd.DataFrame(
[est[-1].coef_ for est in cv_model["estimator"]], columns=feature_names,
)
plt.figure(figsize=figsize)
sns.stripplot(data=coefs, orient="h", palette="dark:k", alpha=0.5)
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5, whis=10)
plt.axvline(x=0, color=".5")
plt.xlabel("Coefficient importance")
plt.title(title)
suptitle = suptitle
plt.suptitle(suptitle, y=1.01, x=0.65)
plt.subplots_adjust(left=0.3)
plt.tight_layout()
return cv_model
def compare_qqplot(model1, model2, X, y, title1, title2, suptitle, figsize=(8,4)):
""" Compare QQ plot for 2 models against normal distribution"""
fig, (ax, ax2) = plt.subplots(1, 2, figsize=figsize, sharey=True, sharex=True)
residuals = calculate_residuals(model1, X, y)
qq = sm.qqplot(data=residuals['Residuals'], line='q', ax=ax, marker='.', markerfacecolor=BASE_COLOR, markeredgecolor=BASE_COLOR)
residuals = calculate_residuals(model2, X, y)
sm.qqplot(data=residuals['Residuals'], line='q', ax=ax2, marker='.', markerfacecolor=BASE_COLOR, markeredgecolor=BASE_COLOR)
ax.set_title(title1)
ax2.set_title(title2)
plt.suptitle(suptitle, y=1.05)
plt.show()
# build a final cleaning function
def cleaning(df, cluster_preprocessor, cluster_model, onehot_preprocessor):
# cluster preprocess for only relevant columns
X_property = listing_latest[['room_type', 'property_type', 'latitude', 'longitude', 'price_mean']].copy(deep=True)
X_property_transformed = cluster_preprocessor.transform(X_property)
# cluster data with already fitted model
X_property_transformed['cluster'] = cluster_model.predict(X_property_transformed)
# rank and renaming clusters to be in sequential order
tmp = X_property_transformed.groupby('cluster').agg(min=('price_mean', 'min'), max=('price_mean', 'max'), nr_listings=('cluster', 'count')).sort_values(by='min')
tmp['rank'] = tmp['min'].rank(method='max').astype('int')
# resequence the cluster names by price
mymap = pd.Series(tmp['rank'].values,index=tmp.index).to_dict()
# rename clusters to sequence them into ranked order
X_property_transformed['property_cluster'] = X_property_transformed['cluster'].map(mymap)
# add cluster to main dataset
df['property_cluster'] = X_property_transformed['property_cluster']
# drop lat/long after clustering
df = df.drop(['latitude', 'longitude'], axis=1)
# split the dataset into dependent and independent datasets
y = df['price_mean'].copy()
X = df.drop('price_mean', axis=1)
# final onehot encoding for all features dropping sparse columns
X_transformed = onehot_preprocessor.transform(X)
# make sure property types are well presented in both training and testing sets using stratify
X_train, X_test, y_train, y_test = train_test_split(X_transformed,
y,
test_size=0.25,
random_state=random_state,
stratify=X_transformed[['property_cluster',
'is_business_True']],
)
return X_train, X_test, y_train, y_test
# features to be used in modelling
main_cols = ['host_is_superhost', 'accommodates', 'bedrooms', 'bathrooms',
'minimum_minimum_nights', 'calculated_host_listings_count', 'host_years_active', 'price_mean',
'private_bathroom', 'host_same_location', 'is_business', 'availability_90', 'neighbourhood_cleansed',
'property_type', 'room_type', 'review_scores_location', 'review_scores_value', 'latitude', 'longitude']
# load the enhanced listing file
listing = pd.read_pickle(FILE_NAME_ENHANCED)
listing.head(3)
| listing_id | listing_url | name | host_id | host_name | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_neighbourhood | host_has_profile_pic | host_identity_verified | neighbourhood_cleansed | latitude | longitude | property_type | room_type | accommodates | bathrooms | bedrooms | beds | price | minimum_minimum_nights | minimum_maximum_nights | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | host_has_about | host_has_neighborhood_overview | host_years_active | price_mean | private_bathroom | host_same_location | is_business | minimum_night_ranges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 157714 | https://www.airbnb.com/rooms/157714 | Townhouse in Hammersmith · ★4.71 · 1 bedroom ·... | 757377 | Eilinn | within a few hours | 90 | 68 | False | Hammersmith | True | True | Hammersmith and Fulham | 51.48974 | -0.22208 | Private room in townhouse | Private room | 3 | 1.5 | 1.0 | 1.0 | 69 | 4 | 21 | 10 | 37 | 67 | 67 | 175 | 10 | 0 | 2011-07-13 | 2023-05-02 | 4.71 | 4.66 | 4.78 | 4.84 | 4.83 | 4.76 | 4.66 | False | 1 | 0 | 1 | 0 | 1.21 | True | True | 12 | 70 | True | True | False | 7 |
| 1 | 13913 | https://www.airbnb.com/rooms/13913 | Rental unit in Islington · ★4.80 · 1 bedroom ·... | 54730 | Alina | within an hour | 100 | 78 | False | LB of Islington | True | True | Islington | 51.56861 | -0.11270 | Private room in rental unit | Private room | 1 | 1.0 | 1.0 | 1.0 | 79 | 1 | 29 | 23 | 48 | 78 | 353 | 41 | 15 | 0 | 2010-08-18 | 2022-12-11 | 4.80 | 4.72 | 4.72 | 4.74 | 4.82 | 4.69 | 4.69 | False | 2 | 1 | 1 | 0 | 0.26 | True | True | 14 | 79 | False | True | False | 7 |
| 2 | 15400 | https://www.airbnb.com/rooms/15400 | Rental unit in London · ★4.80 · 1 bedroom · 1 ... | 60302 | Philippa | within a day | 100 | 47 | False | Chelsea | True | True | Kensington and Chelsea | 51.48780 | -0.16813 | Entire rental unit | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | 100 | 7 | 30 | 0 | 2 | 2 | 70 | 94 | 5 | 0 | 2009-12-21 | 2023-05-01 | 4.80 | 4.85 | 4.88 | 4.88 | 4.83 | 4.93 | 4.74 | False | 1 | 1 | 0 | 0 | 0.57 | True | True | 14 | 107 | True | True | False | 7 |
# get neighborhood geoshapes for spatial analysis
crs_4326 = CRS("WGS84")
neighbourhood = gpd.read_file('../data/raw/neighbourhood_geojson/london/neighbourhoods_060623.geojson', crs=crs_4326)
neighbourhood.head()
| neighbourhood | neighbourhood_group | geometry | |
|---|---|---|---|
| 0 | Kingston upon Thames | None | MULTIPOLYGON (((-0.330679000000000 51.32901100... |
| 1 | Croydon | None | MULTIPOLYGON (((-0.064021000000000 51.31863800... |
| 2 | Bromley | None | MULTIPOLYGON (((0.012131000000000 51.299599000... |
| 3 | Hounslow | None | MULTIPOLYGON (((-0.244562000000000 51.48870200... |
| 4 | Ealing | None | MULTIPOLYGON (((-0.411833000000000 51.53408400... |
neighbourhood.shape
(33, 3)
listing_clean = listing.copy()
# make sure that each record is a unique listing
listing_clean['listing_id'].duplicated().sum()
0
# for easy trouble shooting, lets make listing_id the index
listing_clean = listing_clean.set_index('listing_id')
listing_clean = listing_clean.loc[:, main_cols]
listing_clean.head()
| host_is_superhost | accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | price_mean | private_bathroom | host_same_location | is_business | availability_90 | neighbourhood_cleansed | property_type | room_type | review_scores_location | review_scores_value | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | |||||||||||||||||||
| 157714 | False | 3 | 1.0 | 1.5 | 4 | 1 | 12 | 70 | True | True | False | 67 | Hammersmith and Fulham | Private room in townhouse | Private room | 4.76 | 4.66 | 51.48974 | -0.22208 |
| 13913 | False | 1 | 1.0 | 1.0 | 1 | 2 | 14 | 79 | False | True | False | 78 | Islington | Private room in rental unit | Private room | 4.69 | 4.69 | 51.56861 | -0.11270 |
| 15400 | False | 2 | 1.0 | 1.0 | 7 | 1 | 14 | 107 | True | True | False | 2 | Kensington and Chelsea | Entire rental unit | Entire home/apt | 4.93 | 4.74 | 51.48780 | -0.16813 |
| 306333 | False | 5 | 2.0 | 1.0 | 2 | 1 | 11 | 198 | True | False | False | 4 | Hackney | Entire rental unit | Entire home/apt | 4.89 | 4.53 | 51.52748 | -0.08172 |
| 159736 | False | 2 | 1.0 | 1.0 | 4 | 4 | 12 | 62 | False | False | False | 2 | Lambeth | Private room in rental unit | Private room | 4.34 | 4.66 | 51.46788 | -0.09993 |
# make sure features that will be one hot encoded have no spaces in it's values, as it will become part of heading
categorical_columns = ['neighbourhood_cleansed', 'property_type', 'room_type']
listing_clean[categorical_columns] = listing_clean[categorical_columns].apply(lambda x: x.str.replace(' ', '_').str.lower())
listing_clean[categorical_columns] = listing_clean[categorical_columns].apply(lambda x: x.str.replace('/', '_').str.lower())
neighbourhood['neighbourhood'] = neighbourhood['neighbourhood'].apply(lambda x: x.replace(' ', '_').lower())
neighbourhood['neighbourhood'] = neighbourhood['neighbourhood'].apply(lambda x: x.replace('/', '_').lower())
# check for outliers
listing_clean.describe([0.025, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975, 0.99, 0.999])
| accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | price_mean | availability_90 | review_scores_location | review_scores_value | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 68923.000000 | 68867.000000 | 68842.000000 | 68923.000000 | 68923.000000 | 68923.000000 | 68923.000000 | 68923.000000 | 47720.000000 | 47719.000000 | 68923.000000 | 68923.000000 |
| mean | 3.168014 | 1.475975 | 1.338630 | 4.350202 | 19.274059 | 6.251208 | 202.492608 | 33.066770 | 4.709157 | 4.553347 | 51.509690 | -0.129571 |
| std | 1.985088 | 0.962448 | 0.676587 | 9.065348 | 62.684541 | 3.230126 | 240.753573 | 31.683168 | 0.441736 | 0.564701 | 0.048800 | 0.101330 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 51.295937 | -0.497800 |
| 2.5% | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 30.000000 | 0.000000 | 3.670000 | 3.000000 | 51.398398 | -0.344240 |
| 25% | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 70.000000 | 0.000000 | 4.620000 | 4.430000 | 51.484790 | -0.190558 |
| 50% | 2.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 7.000000 | 129.000000 | 26.000000 | 4.830000 | 4.700000 | 51.513730 | -0.129850 |
| 75% | 4.000000 | 2.000000 | 1.500000 | 4.000000 | 7.000000 | 9.000000 | 232.000000 | 61.000000 | 5.000000 | 4.900000 | 51.538970 | -0.069360 |
| 90% | 6.000000 | 3.000000 | 2.000000 | 7.000000 | 40.000000 | 10.000000 | 409.000000 | 83.000000 | 5.000000 | 5.000000 | 51.566208 | -0.011270 |
| 95% | 7.000000 | 3.000000 | 2.500000 | 14.000000 | 96.000000 | 11.000000 | 619.000000 | 89.000000 | 5.000000 | 5.000000 | 51.588314 | 0.032107 |
| 97.5% | 8.000000 | 4.000000 | 3.000000 | 28.000000 | 224.000000 | 12.000000 | 990.000000 | 90.000000 | 5.000000 | 5.000000 | 51.603030 | 0.085716 |
| 99% | 10.000000 | 5.000000 | 3.500000 | 45.000000 | 291.000000 | 12.000000 | 1507.780000 | 90.000000 | 5.000000 | 5.000000 | 51.624190 | 0.141005 |
| 99.9% | 16.000000 | 7.000000 | 5.500000 | 90.000000 | 543.000000 | 14.000000 | 1791.078000 | 90.000000 | 5.000000 | 5.000000 | 51.660925 | 0.217254 |
| max | 16.000000 | 50.000000 | 50.000000 | 90.000000 | 543.000000 | 15.000000 | 2000.000000 | 90.000000 | 5.000000 | 5.000000 | 51.681502 | 0.288570 |
# remove final outliers
listing_clean = listing_clean[listing_clean['accommodates'] <= 8]
listing_clean = listing_clean[listing_clean['bedrooms'] <= 5]
listing_clean = listing_clean[listing_clean['bathrooms'] <= 4]
listing_clean = listing_clean[listing_clean['price_mean'].between(30, 1500)]
listing_clean.describe([0.025, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975, 0.99, 0.999])
| accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | price_mean | availability_90 | review_scores_location | review_scores_value | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 65181.000000 | 65181.000000 | 65181.000000 | 65181.000000 | 65181.000000 | 65181.000000 | 65181.000000 | 65181.000000 | 45741.000000 | 45741.000000 | 65181.000000 | 65181.000000 |
| mean | 3.069054 | 1.423175 | 1.300916 | 4.327473 | 16.692380 | 6.233703 | 186.209616 | 33.416210 | 4.714553 | 4.557691 | 51.509394 | -0.130231 |
| std | 1.683950 | 0.819303 | 0.530518 | 9.078507 | 52.108377 | 3.255960 | 186.584035 | 31.544086 | 0.434996 | 0.557968 | 0.048391 | 0.100523 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 30.000000 | 0.000000 | 0.000000 | 0.000000 | 51.295937 | -0.497800 |
| 2.5% | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 35.000000 | 0.000000 | 3.670000 | 3.000000 | 51.398645 | -0.343115 |
| 25% | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 73.000000 | 0.000000 | 4.630000 | 4.430000 | 51.484850 | -0.190697 |
| 50% | 2.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 7.000000 | 129.000000 | 27.000000 | 4.840000 | 4.710000 | 51.513620 | -0.130720 |
| 75% | 4.000000 | 2.000000 | 1.500000 | 4.000000 | 7.000000 | 9.000000 | 225.000000 | 61.000000 | 5.000000 | 4.900000 | 51.538110 | -0.070180 |
| 90% | 6.000000 | 3.000000 | 2.000000 | 7.000000 | 36.000000 | 10.000000 | 382.000000 | 83.000000 | 5.000000 | 5.000000 | 51.564930 | -0.012780 |
| 95% | 6.000000 | 3.000000 | 2.500000 | 14.000000 | 83.000000 | 11.000000 | 541.000000 | 89.000000 | 5.000000 | 5.000000 | 51.587160 | 0.029320 |
| 97.5% | 7.000000 | 4.000000 | 3.000000 | 28.000000 | 177.000000 | 12.000000 | 752.000000 | 90.000000 | 5.000000 | 5.000000 | 51.602310 | 0.082670 |
| 99% | 8.000000 | 4.000000 | 3.000000 | 45.000000 | 289.000000 | 12.000000 | 1000.000000 | 90.000000 | 5.000000 | 5.000000 | 51.623584 | 0.138979 |
| 99.9% | 8.000000 | 5.000000 | 4.000000 | 90.000000 | 543.000000 | 14.000000 | 1412.000000 | 90.000000 | 5.000000 | 5.000000 | 51.660162 | 0.214168 |
| max | 8.000000 | 5.000000 | 4.000000 | 90.000000 | 543.000000 | 15.000000 | 1500.000000 | 90.000000 | 5.000000 | 5.000000 | 51.681502 | 0.278960 |
# correct datatypes
listing_clean['neighbourhood_cleansed'] = listing_clean['neighbourhood_cleansed'].astype('category')
listing_clean['room_type'] = listing_clean['room_type'].astype('category')
listing_clean['property_type'] = listing_clean['property_type'].astype('category')
listing_clean['bedrooms'] = listing_clean['bedrooms'].astype(int)
listing_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 65181 entries, 157714 to 908453389034894825 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 host_is_superhost 65181 non-null bool 1 accommodates 65181 non-null int64 2 bedrooms 65181 non-null int64 3 bathrooms 65181 non-null float64 4 minimum_minimum_nights 65181 non-null int64 5 calculated_host_listings_count 65181 non-null int64 6 host_years_active 65181 non-null int64 7 price_mean 65181 non-null int64 8 private_bathroom 65181 non-null bool 9 host_same_location 65181 non-null bool 10 is_business 65181 non-null bool 11 availability_90 65181 non-null int64 12 neighbourhood_cleansed 65181 non-null category 13 property_type 65181 non-null category 14 room_type 65181 non-null category 15 review_scores_location 45741 non-null float64 16 review_scores_value 45741 non-null float64 17 latitude 65181 non-null float64 18 longitude 65181 non-null float64 dtypes: bool(4), category(3), float64(5), int64(7) memory usage: 6.9 MB
Let's double-check target feature distribution. Is it normally distributed?¶
Skew:
- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
- If the skewness is between -1 and — 0.5 or between 0.5 and 1, the data are moderately skewed
- If the skewness is less than -1 or greater than 1, the data are highly skewed
Kurtosis describes: the peakedness of the distribution.
- If the distribution is tall and thin it is called a leptokurtic distribution(Kurtosis > 3). Values in a leptokurtic distribution are near the mean or at the extremes.
- A flat distribution where the values are moderately spread out (i.e., unlike leptokurtic) is called platykurtic(Kurtosis <3) distribution.
- A distribution whose shape is in between a leptokurtic distribution and a platykurtic distribution is called a mesokurtic(Kurtosis=3) distribution. A mesokurtic distribution looks more close to a normal distribution.
plt.figure(figsize=(18,4))
plt.subplot(1,3,1)
xbinsize = 25
xbins = np.arange(0, listing_clean['price_mean'].max()+xbinsize, xbinsize)
ax1 = sns.histplot(data=listing_clean, x='price_mean', bins=xbins, kde=True, color=BASE_COLOR_SUB1)
plt.axvline(x=listing_clean['price_mean'].mean(), color='r', linestyle=':')
xticks=ax1.get_xticks()
labels = ['£{:1.0f}'.format(tick) for tick in xticks]
plt.xticks(xticks, labels)
plt.title('Untransformed distribution')
plt.xlabel('Mean price (per night)')
plt.ylabel('Frequency')
plt.subplot(1,3,2)
df = pd.DataFrame()
df['price_mean'], fitted_lambda = stats.boxcox(listing_clean['price_mean'])
ax2 = sns.histplot(data=df, x='price_mean', bins=100, color=BASE_COLOR_SUB2, kde=True)
plt.axvline(x=df['price_mean'].mean(), color='r', linestyle=':')
xticks=ax2.get_xticks()
labels = ['£{:1.0f}'.format(inv_boxcox(tick, fitted_lambda)) for tick in xticks]
plt.xticks(xticks,labels)
plt.title('Boxcox distribution')
plt.xlabel('Mean price (per night) in pounds')
plt.ylabel('Frequency')
plt.subplot(1,3,3)
df2 = pd.DataFrame()
df2['price_mean'] = log_transform(listing_clean['price_mean'])
ax3 = sns.histplot(data=df2, x='price_mean', bins=100, color=BASE_COLOR_SUB2, kde=True)
plt.axvline(x=df2['price_mean'].mean(), color='r', linestyle=':')
xticks=ax3.get_xticks()
labels = ['£{:1.0f}'.format(inverse_log_transform(tick)) for tick in xticks]
plt.xticks(xticks,labels)
plt.title('Log10 distribution')
plt.xlabel('Mean price (per night) in pounds')
plt.ylabel('Frequency')
plt.suptitle('Compare distributions of mean price per night (in pounds)')
# Add the skew and kurtosis score in the legend of each axis
for ax, df in zip([ax1, ax2, ax3], [listing_clean['price_mean'], df['price_mean'], df2['price_mean']]):
skew = round(df.skew(),3)
kurtosis = round(df.kurtosis(), 3)
ax.plot([], [], " ", label=f"skew={skew} \nkurtosis={kurtosis}")
ax.legend(loc='upper right')
plt.show()
Observation: Mean price is kind of normally distributed on both boxcox and log10 scale. Log10 scale seems to have higher density around the mean
Which transformation strategy to use ?¶
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(221)
xt = listing_clean['price_mean']
stats.probplot(xt, dist=stats.norm, plot=ax1)
ax1.set_xlabel('')
ax1.set_title('Probability plot WITHOUT price transformation')
ax1.get_lines()[0].set_color(BASE_COLOR)
ax1.get_lines()[1].set_color(BASE_COMPLIMENTARY)
ax2 = fig.add_subplot(222)
xt, fitted_lambda = stats.boxcox(listing_clean['price_mean'])
stats.probplot(xt, dist=stats.norm, plot=ax2)
ax2.set_title('Probability plot WITH price BOXCOX transformation')
ax2.set_xlabel('')
yticks = ax2.get_yticks()
labels = ['{:1.0f}'.format(inv_boxcox(tick, fitted_lambda)) for tick in yticks]
plt.yticks(yticks,labels)
ax2.get_lines()[0].set_color(BASE_COLOR)
ax2.get_lines()[1].set_color(BASE_COMPLIMENTARY)
ax3 = fig.add_subplot(223)
xt = np.log1p(listing_clean['price_mean'])
stats.probplot(xt, dist=stats.norm, plot=ax3)
ax3.set_title('Probability plot WITH price LOG transformation')
yticks = ax3.get_yticks()
labels = ['{:1.0f}'.format(np.exp(tick)) for tick in yticks]
plt.yticks(yticks,labels)
ax3.get_lines()[0].set_color(BASE_COLOR)
ax3.get_lines()[1].set_color(BASE_COMPLIMENTARY)
ax4 = fig.add_subplot(224)
xt = log_transform(listing_clean['price_mean'])
stats.probplot(xt, dist=stats.norm, plot=ax4)
ax4.set_title('Probability plot WITH price LOG10 transformation')
yticks = ax4.get_yticks()
labels = ['{:1.0f}'.format(inverse_log_transform(tick)) for tick in yticks]
plt.yticks(yticks,labels)
ax4.get_lines()[0].set_color(BASE_COLOR)
ax4.get_lines()[1].set_color(BASE_COMPLIMENTARY)
plt.tight_layout()
Let's double-check the distribution of independent variables¶
# we want different approaches for numeric vs categorical features
numerical_features = listing_clean.select_dtypes(include='number').columns.tolist()
numerical_features.remove('price_mean')
numerical_features.remove('latitude')
numerical_features.remove('longitude')
categorical_features = listing_clean.select_dtypes(exclude='number').columns.tolist()
# distribution of numerical columns
pd.plotting.hist_frame(listing_clean, figsize=(20,8), bins=30, layout=(3,4), column=numerical_features, xlabelsize=8, ylabelsize=8, color=BASE_COLOR)
plt.suptitle('Distribution of numerical columns')
plt.show()
# plot distributions for categorical features
for col in categorical_features:
vs.hist_by_cat(df=listing_clean, col=col, topn=30)
Correlation analysis¶
- Select features correlated with price
- Make sure independent variables does not have too high correlations with each other
df_corr = listing_clean.corr(method='spearman', numeric_only=True, min_periods=1)
sns.clustermap(df_corr, cmap="Spectral", center=0, annot=True, fmt='.2f',
annot_kws={"fontsize":8}, figsize=(10,10))
plt.title('Feature correlation')
plt.xticks(fontsize=10)
plt.show()
Which kind of relationship features have with price ?¶
# Get the strongest correlation of features with mean price
corr_matrix = listing_clean[numerical_features].corrwith(listing_clean["price_mean"]).sort_values()
# Plot feature correlations to no_show
palette = sns.color_palette("viridis_r", 15)
pos = 'cadetblue'
neg = 'pink'
corr_matrix.plot(kind='bar', color=np.vectorize({True: pos, False: neg}.get)(corr_matrix > 0))
plt.xticks(rotation='vertical')
plt.ylabel('correlation')
plt.xlabel('features')
plt.title('Feature correlation with average listing price')
plt.show()
# Display relationship between price and numerical features
plt.figure(figsize=(14, 8))
for i, col in enumerate(numerical_features):
plt.subplot(3, 3, i+1)
sns.scatterplot(x=listing_clean[col], y=listing_clean['price_mean'], alpha=0.1, color=BASE_HIGHLIGHT_INTENSE, edgecolor='black', linewidth=0.05)
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')
plt.suptitle('Relationship between price and numerical features', fontsize=14, y=1.05)
plt.tight_layout()
# Does features have a linear relationship with price ?
listing_clean['price_group'] = np.where(listing_clean['price_mean'] <= 500, 0, 1)
sns.set(style='ticks')
x_vars = numerical_features
for col in x_vars:
ax = sns.lmplot(data=listing_clean, x=col, y='price_mean', hue='price_group', x_jitter=0.1, palette=[BASE_COLOR, BASE_COMPLIMENTARY],
scatter_kws={'s':15, 'edgecolor':'black', 'linewidths':0.2}, height=3.5)
ax = plt.gca()
ax.set_title('{} price group trend'.format(col))
plt.show()
# reset default plot formatting back as PairGrid change the defaults somehow
# BASE_COLOR, BASE_HIGHLIGHT_INTENSE, BASE_HIGHLIGHT, BASE_COMPLEMENTRY, BASE_GREY, BASE_COLOR_ARR, BASE_COLOR_DEP, SYMBOLS = vs.set_plot_defaults()
Observation: A linear relationship can be observed in most cases, however the slop of the fit changes for prices below and above £500, especially for number of bedrooms, bathrooms and number of people that can be accommodated.
Interaction analysis¶
Do features like neighbourhood and room type change the relationship some features have with price ?
# Interaction with neighbourhood
cat = listing_clean.groupby('neighbourhood_cleansed')['price_mean'].mean().sort_values(ascending=False)
hue_order = list(cat.index)
for col in ['bedrooms', 'bathrooms', 'accommodates', 'host_is_superhost', 'is_business']:
fig = plt.figure(figsize=(16,4))
plt.subplot(1,2,1)
sns.pointplot(data=listing_clean, x=col, y="price_mean", color=BASE_COLOR)
plt.subplot(1,2,2)
sns.pointplot(data=listing_clean, x=col, y="price_mean", hue='neighbourhood_cleansed', hue_order=hue_order[:4],
palette='Spectral_r')
plt.suptitle('{} interaction with neighbourhood'.format(col))
plt.show()
/var/folders/4p/r9m3gynn3wj6h4yh4_lwz5fw0000gn/T/ipykernel_35437/1392799286.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
cat = listing_clean.groupby('neighbourhood_cleansed')['price_mean'].mean().sort_values(ascending=False)
Observation: There are indeed a lot of interactions introduced when we add features like neighbourhood and room type. Linear regression patterns can be observed, but the slope of the lines are changing for each neighbourhood
def custom(y, **kwargs):
ym = y.mean()
plt.axhline(ym, color=BASE_COMPLIMENTARY, linestyle="dashed")
plt.annotate(f"mean: £{y.mean():.0f}", xy=(1,ym),
xycoords=plt.gca().get_yaxis_transform(), ha="right", fontsize=12, color='r')
for col in ['neighbourhood_cleansed', 'room_type', 'property_type']:
avg_price = listing_clean['price_mean'].mean()
cat = listing_clean.groupby(col, observed=False)['price_mean'].mean().sort_values(ascending=False)
order = list(cat.index)
plt.figure(figsize=(16,6))
ax = sns.pointplot(data=listing_clean, x=col, y='price_mean', errorbar='sd', marker='o', order=order, color=BASE_HIGHLIGHT_INTENSE)
ax.axhline(avg_price, color=BASE_COMPLIMENTARY, linestyle="dashed", linewidth=2)
plt.title('Average price per night by {}'.format(col))
plt.ylabel('Average price (per night) in pounds')
plt.xlabel('col')
plt.xticks(rotation=90)
yticks = np.arange(0, 1300, 100)
ylabels = ['£{:1.0f}'.format(tick) for tick in yticks]
plt.yticks(yticks, ylabels)
plt.show()
Observation:
Neighbourhood
There's a constant variation in price between neighbourhoods, however more variation are observed in
WestminsterandKensington and ChelseaOnly the top 6 neighbourhoods show mean prices of above £200 per night, all other neighbourhoods have prices below £200
Room type
Entire homes and hotels show mean prices of above £200 per night, all other room types have prices below £200
Property type
Around 10 property types have extreme mean prices above £300 per night, with lots of variability like entire villas and townhouses
Let's explore a bit deeper the relationship between price and each type of room¶
def custom(y, **kwargs):
ym = y.mean()
plt.axhline(ym, color=BASE_COMPLIMENTARY, linestyle="dashed")
plt.annotate(f"mean: £{y.mean():.0f}", xy=(1,ym),
xycoords=plt.gca().get_yaxis_transform(), ha="right", fontsize=12, color='r', weight='bold')
g = sns.FacetGrid(data = listing_clean, row = 'room_type', height=5, aspect=3, sharex=True, sharey=True)
g.map(sns.pointplot, 'property_type', 'price_mean', order=order, errorbar=('pi', 97.5), seed=88, linestyles=None, linewidth=2,
err_kws={'linewidth': 1}, color=BASE_COLOR)
g = g.map(custom, 'price_mean')
plt.xticks(rotation=90)
plt.tight_layout()
Observation:
- Hotel rooms and entire homes or apartments show prices above £200 per night, whilst shared rooms and private rooms show prices below £100 per night
Let's use clustering to build a better price grouping¶
We have over 80 property types which could make the model after one-hot encoding too complex and cause over fitting. We could use room type instead, but this has only 4 categories with so much variation in price, it's too simple for modelling.
- Let's use both room type and property type, but cluster it for simplification.
- Let's minimize the interacts location, room type and property type brings, whilst getting rid of some of the relations between independent features like room type and property type
# encode selected categorical variables first before clustering
X_property = listing_clean[['room_type', 'property_type', 'latitude', 'longitude', 'price_mean']].copy(deep=True)
onehot_cluster = Pipeline(steps=[
('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int'))])
cluster_preprocessor = ColumnTransformer(transformers=[
('category', onehot_cluster, ['property_type', 'room_type']),
],
verbose_feature_names_out=False,
remainder = 'passthrough'
)
_ = cluster_preprocessor.fit(X_property).set_output(transform="pandas")
X_property_transformed = cluster_preprocessor.transform(X_property)
X_property_transformed.head(2)
| property_type_barn | property_type_boat | property_type_camper_rv | property_type_campsite | property_type_casa_particular | property_type_castle | property_type_earthen_home | property_type_entire_bungalow | property_type_entire_cabin | property_type_entire_chalet | property_type_entire_condo | property_type_entire_cottage | property_type_entire_guest_suite | property_type_entire_guesthouse | property_type_entire_home | property_type_entire_home_apt | property_type_entire_loft | property_type_entire_place | property_type_entire_rental_unit | property_type_entire_serviced_apartment | property_type_entire_townhouse | property_type_entire_vacation_home | property_type_entire_villa | property_type_farm_stay | property_type_floor | property_type_houseboat | property_type_hut | property_type_island | property_type_private_room | property_type_private_room_in_bed_and_breakfast | property_type_private_room_in_boat | property_type_private_room_in_bungalow | property_type_private_room_in_cabin | property_type_private_room_in_camper_rv | property_type_private_room_in_casa_particular | property_type_private_room_in_castle | property_type_private_room_in_chalet | property_type_private_room_in_condo | property_type_private_room_in_cottage | property_type_private_room_in_earthen_home | property_type_private_room_in_farm_stay | property_type_private_room_in_floor | property_type_private_room_in_guest_suite | property_type_private_room_in_guesthouse | property_type_private_room_in_home | property_type_private_room_in_hostel | property_type_private_room_in_houseboat | property_type_private_room_in_hut | property_type_private_room_in_lighthouse | property_type_private_room_in_loft | property_type_private_room_in_minsu | property_type_private_room_in_nature_lodge | property_type_private_room_in_religious_building | property_type_private_room_in_rental_unit | property_type_private_room_in_serviced_apartment | property_type_private_room_in_shepherd's_hut | property_type_private_room_in_tiny_home | property_type_private_room_in_tower | property_type_private_room_in_townhouse | property_type_private_room_in_treehouse | property_type_private_room_in_vacation_home | property_type_private_room_in_villa | property_type_private_room_in_yurt | property_type_religious_building | property_type_riad | property_type_room_in_aparthotel | property_type_room_in_bed_and_breakfast | property_type_room_in_boutique_hotel | property_type_room_in_hostel | property_type_room_in_hotel | property_type_room_in_serviced_apartment | property_type_shared_room | property_type_shared_room_in_bed_and_breakfast | property_type_shared_room_in_boutique_hotel | property_type_shared_room_in_bungalow | property_type_shared_room_in_bus | property_type_shared_room_in_casa_particular | property_type_shared_room_in_condo | property_type_shared_room_in_earthen_home | property_type_shared_room_in_farm_stay | property_type_shared_room_in_guest_suite | property_type_shared_room_in_guesthouse | property_type_shared_room_in_home | property_type_shared_room_in_hostel | property_type_shared_room_in_hotel | property_type_shared_room_in_loft | property_type_shared_room_in_religious_building | property_type_shared_room_in_rental_unit | property_type_shared_room_in_serviced_apartment | property_type_shared_room_in_tent | property_type_shared_room_in_townhouse | property_type_shared_room_in_vacation_home | property_type_shared_room_in_villa | property_type_shepherd’s_hut | property_type_shipping_container | property_type_tiny_home | property_type_tower | property_type_yurt | room_type_entire_home_apt | room_type_hotel_room | room_type_private_room | room_type_shared_room | latitude | longitude | price_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 157714 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 51.48974 | -0.22208 | 70 |
| 13913 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 51.56861 | -0.11270 | 79 |
Let's use gausian mixture model (GMM) algorithm to do clustering, as it will look for clusters that have a normal price distribution within each cluster, which will provide a better split
# Clustering
gmm = GaussianMixture(n_components=25, random_state=random_state, covariance_type='spherical', init_params='k-means++')
X_property_transformed['cluster'] = gmm.fit_predict(X_property_transformed)
X_property_transformed.head(2)
| property_type_barn | property_type_boat | property_type_camper_rv | property_type_campsite | property_type_casa_particular | property_type_castle | property_type_earthen_home | property_type_entire_bungalow | property_type_entire_cabin | property_type_entire_chalet | property_type_entire_condo | property_type_entire_cottage | property_type_entire_guest_suite | property_type_entire_guesthouse | property_type_entire_home | property_type_entire_home_apt | property_type_entire_loft | property_type_entire_place | property_type_entire_rental_unit | property_type_entire_serviced_apartment | property_type_entire_townhouse | property_type_entire_vacation_home | property_type_entire_villa | property_type_farm_stay | property_type_floor | property_type_houseboat | property_type_hut | property_type_island | property_type_private_room | property_type_private_room_in_bed_and_breakfast | property_type_private_room_in_boat | property_type_private_room_in_bungalow | property_type_private_room_in_cabin | property_type_private_room_in_camper_rv | property_type_private_room_in_casa_particular | property_type_private_room_in_castle | property_type_private_room_in_chalet | property_type_private_room_in_condo | property_type_private_room_in_cottage | property_type_private_room_in_earthen_home | property_type_private_room_in_farm_stay | property_type_private_room_in_floor | property_type_private_room_in_guest_suite | property_type_private_room_in_guesthouse | property_type_private_room_in_home | property_type_private_room_in_hostel | property_type_private_room_in_houseboat | property_type_private_room_in_hut | property_type_private_room_in_lighthouse | property_type_private_room_in_loft | property_type_private_room_in_minsu | property_type_private_room_in_nature_lodge | property_type_private_room_in_religious_building | property_type_private_room_in_rental_unit | property_type_private_room_in_serviced_apartment | property_type_private_room_in_shepherd's_hut | property_type_private_room_in_tiny_home | property_type_private_room_in_tower | property_type_private_room_in_townhouse | property_type_private_room_in_treehouse | property_type_private_room_in_vacation_home | property_type_private_room_in_villa | property_type_private_room_in_yurt | property_type_religious_building | property_type_riad | property_type_room_in_aparthotel | property_type_room_in_bed_and_breakfast | property_type_room_in_boutique_hotel | property_type_room_in_hostel | property_type_room_in_hotel | property_type_room_in_serviced_apartment | property_type_shared_room | property_type_shared_room_in_bed_and_breakfast | property_type_shared_room_in_boutique_hotel | property_type_shared_room_in_bungalow | property_type_shared_room_in_bus | property_type_shared_room_in_casa_particular | property_type_shared_room_in_condo | property_type_shared_room_in_earthen_home | property_type_shared_room_in_farm_stay | property_type_shared_room_in_guest_suite | property_type_shared_room_in_guesthouse | property_type_shared_room_in_home | property_type_shared_room_in_hostel | property_type_shared_room_in_hotel | property_type_shared_room_in_loft | property_type_shared_room_in_religious_building | property_type_shared_room_in_rental_unit | property_type_shared_room_in_serviced_apartment | property_type_shared_room_in_tent | property_type_shared_room_in_townhouse | property_type_shared_room_in_vacation_home | property_type_shared_room_in_villa | property_type_shepherd’s_hut | property_type_shipping_container | property_type_tiny_home | property_type_tower | property_type_yurt | room_type_entire_home_apt | room_type_hotel_room | room_type_private_room | room_type_shared_room | latitude | longitude | price_mean | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 157714 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 51.48974 | -0.22208 | 70 | 0 |
| 13913 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 51.56861 | -0.11270 | 79 | 16 |
# display the min/max prices for each cluster to check if the price groups make sense
tmp = X_property_transformed.groupby('cluster').agg(min=('price_mean', 'min'), max=('price_mean', 'max'), nr_listings=('cluster', 'count')).sort_values(by='min')
tmp['rank'] = tmp['min'].rank(method='max').astype('int')
tmp
| min | max | nr_listings | rank | |
|---|---|---|---|---|
| cluster | ||||
| 17 | 30 | 47 | 6708 | 1 |
| 0 | 48 | 70 | 8896 | 2 |
| 16 | 71 | 91 | 7357 | 3 |
| 9 | 92 | 107 | 4771 | 4 |
| 22 | 108 | 126 | 4559 | 5 |
| 7 | 127 | 145 | 3813 | 6 |
| 18 | 146 | 172 | 5269 | 7 |
| 4 | 173 | 206 | 5592 | 8 |
| 14 | 207 | 236 | 3117 | 9 |
| 20 | 237 | 261 | 2501 | 10 |
| 2 | 262 | 290 | 1890 | 11 |
| 13 | 291 | 331 | 2352 | 12 |
| 8 | 332 | 400 | 2555 | 13 |
| 19 | 401 | 466 | 1373 | 14 |
| 5 | 467 | 516 | 904 | 15 |
| 23 | 517 | 572 | 621 | 16 |
| 10 | 573 | 630 | 563 | 17 |
| 1 | 631 | 722 | 567 | 18 |
| 12 | 723 | 817 | 395 | 19 |
| 3 | 819 | 921 | 313 | 20 |
| 24 | 922 | 1045 | 257 | 21 |
| 11 | 998 | 1001 | 365 | 22 |
| 15 | 1048 | 1250 | 243 | 23 |
| 6 | 1252 | 1412 | 135 | 24 |
| 21 | 1418 | 1500 | 65 | 25 |
# resquence the cluster names by price
mymap = pd.Series(tmp['rank'].values,index=tmp.index).to_dict()
mymap
{17: 1,
0: 2,
16: 3,
9: 4,
22: 5,
7: 6,
18: 7,
4: 8,
14: 9,
20: 10,
2: 11,
13: 12,
8: 13,
19: 14,
5: 15,
23: 16,
10: 17,
1: 18,
12: 19,
3: 20,
24: 21,
11: 22,
15: 23,
6: 24,
21: 25}
# rename clusters to sequence them into ranked order
X_property_transformed['property_cluster'] = X_property_transformed['cluster'].map(mymap)
# add cluster to main dataset and double check correctness
listing_clean['property_cluster'] = X_property_transformed['property_cluster']
listing_clean.head(3)
| host_is_superhost | accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | price_mean | private_bathroom | host_same_location | is_business | availability_90 | neighbourhood_cleansed | property_type | room_type | review_scores_location | review_scores_value | latitude | longitude | price_group | property_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | |||||||||||||||||||||
| 157714 | False | 3 | 1 | 1.5 | 4 | 1 | 12 | 70 | True | True | False | 67 | hammersmith_and_fulham | private_room_in_townhouse | private_room | 4.76 | 4.66 | 51.48974 | -0.22208 | 0 | 2 |
| 13913 | False | 1 | 1 | 1.0 | 1 | 2 | 14 | 79 | False | True | False | 78 | islington | private_room_in_rental_unit | private_room | 4.69 | 4.69 | 51.56861 | -0.11270 | 0 | 3 |
| 15400 | False | 2 | 1 | 1.0 | 7 | 1 | 14 | 107 | True | True | False | 2 | kensington_and_chelsea | entire_rental_unit | entire_home_apt | 4.93 | 4.74 | 51.48780 | -0.16813 | 0 | 4 |
Let's use spatial analysis to visualize average prices in neighbourhoods more clearly¶
# make a geo spatial dataset with listing locations
cols = ['latitude', 'longitude', 'price_mean', 'property_cluster', 'neighbourhood_cleansed']
listings_geo = listing_clean[cols].copy(deep=True)
listings_geo.rename(columns={'neighbourhood_cleansed': 'neighbourhood'}, inplace=True)
listings_geo.head()
| latitude | longitude | price_mean | property_cluster | neighbourhood | |
|---|---|---|---|---|---|
| listing_id | |||||
| 157714 | 51.48974 | -0.22208 | 70 | 2 | hammersmith_and_fulham |
| 13913 | 51.56861 | -0.11270 | 79 | 3 | islington |
| 15400 | 51.48780 | -0.16813 | 107 | 4 | kensington_and_chelsea |
| 306333 | 51.52748 | -0.08172 | 198 | 8 | hackney |
| 159736 | 51.46788 | -0.09993 | 62 | 2 | lambeth |
# make geometry field
listings_geo['geometry'] = listings_geo.apply(lambda x: Point((x.longitude, x.latitude)), axis=1)
type(listings_geo)
pandas.core.frame.DataFrame
listings_geo = gpd.GeoDataFrame(listings_geo, crs=crs_4326, geometry=listings_geo.geometry)
listings_geo.head()
| latitude | longitude | price_mean | property_cluster | neighbourhood | geometry | |
|---|---|---|---|---|---|---|
| listing_id | ||||||
| 157714 | 51.48974 | -0.22208 | 70 | 2 | hammersmith_and_fulham | POINT (-0.222080000000000 51.489739999999998) |
| 13913 | 51.56861 | -0.11270 | 79 | 3 | islington | POINT (-0.112700000000000 51.568610000000000) |
| 15400 | 51.48780 | -0.16813 | 107 | 4 | kensington_and_chelsea | POINT (-0.168130000000000 51.487800000000000) |
| 306333 | 51.52748 | -0.08172 | 198 | 8 | hackney | POINT (-0.081720000000000 51.527479999999997) |
| 159736 | 51.46788 | -0.09993 | 62 | 2 | lambeth | POINT (-0.099930000000000 51.467880000000001) |
# calculate average price per neighborhood
neigborhood_price = listings_geo.groupby('neighbourhood', as_index=False, observed=False)['price_mean'].mean()
neighbourhood = neighbourhood.merge(neigborhood_price, on='neighbourhood')
neighbourhood.head()
| neighbourhood | neighbourhood_group | geometry | price_mean | |
|---|---|---|---|---|
| 0 | kingston_upon_thames | None | MULTIPOLYGON (((-0.330679000000000 51.32901100... | 123.725490 |
| 1 | croydon | None | MULTIPOLYGON (((-0.064021000000000 51.31863800... | 100.060284 |
| 2 | bromley | None | MULTIPOLYGON (((0.012131000000000 51.299599000... | 114.579585 |
| 3 | hounslow | None | MULTIPOLYGON (((-0.244562000000000 51.48870200... | 135.320796 |
| 4 | ealing | None | MULTIPOLYGON (((-0.411833000000000 51.53408400... | 128.708951 |
fig, ax = plt.subplots(1, 1, figsize=(16,8))
lgnd_kwds = {'label': 'Average price per neighbourhood',
'orientation': 'vertical'}
base = neighbourhood.plot(ax=ax,
column='price_mean',
edgecolor='black',
cmap='OrRd',
alpha=0.5,
zorder=1,
legend=True,
legend_kwds=lgnd_kwds)
base.set_axis_off()
plt.suptitle('Average price for neighbourhoods on Greater London', fontsize=12)
plt.show()
Observation:
- The closer we get to the centre of Greater London, the more expensive the listings get
Visualize the new property clusters¶
def spatial_mapping(data, color):
price_min = data.price_mean.min()
price_max = data.price_mean.max()
ax=plt.gca()
base = data.plot(ax=plt.gca(), zorder=1, marker='o', markersize=40, edgecolor="black", linewidth=1, color=BASE_HIGHLIGHT_INTENSE)
neighbourhood.plot(color='white', edgecolor='black', zorder=2, alpha=0.4, ax=base)
base.set_axis_off()
new_title = ax.get_title() + '\n £{} - £{}'.format(price_min, price_max)
ax.set_title(new_title, fontsize='18')
g = sns.FacetGrid(data=listings_geo, col = 'property_cluster', col_order=[1, 8, 23], height=8, aspect=1)
g.map_dataframe(spatial_mapping)
plt.suptitle('Property price clusters of Greater London', fontsize=20)
plt.tight_layout()
Observation:
- We have perfect clusters. Cluster 1 contains budget accommodation which can be found all over greater London. As the cluster number increase, prices becomes more expensive, and listings are more condensed around the centre.
# analyse the relationship between the new property cluster and listing price
plt.figure(figsize=(20, 10))
for i, col in enumerate(['property_cluster']):
plt.subplot(3, 3, i+1)
sns.scatterplot(x=listing_clean[col], y=listing_clean['price_mean'], alpha=0.1, color=BASE_HIGHLIGHT_INTENSE, edgecolor='black', linewidth=0.05)
plt.title('Relationship betweeen average price and {}'.format(col))
plt.xlabel(col)
plt.ylabel('prices')
plt.tight_layout()
Observation:
- The relationship are not perfect linear, this might cause some problems. It's a more curvy shape, which could lean towards polynomial regression.
- It's rare for properties to have prices over £500, it's a more luxury market, which might be the reason for the curve.
# clear the garbage to free memory
gc.collect()
249
Drop the fields no longer needed for modelling after clustering. Even though room type and location was used during clustering, we will still keep the room type and neighborhood features, to help the model identify the extreme budget and luxury fluctuations.
try:
listing_clean = listing_clean.drop(['latitude', 'longitude', 'price_group'], axis=1)
except:
pass
listing_clean.head(3)
| host_is_superhost | accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | price_mean | private_bathroom | host_same_location | is_business | availability_90 | neighbourhood_cleansed | property_type | room_type | review_scores_location | review_scores_value | property_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | ||||||||||||||||||
| 157714 | False | 3 | 1 | 1.5 | 4 | 1 | 12 | 70 | True | True | False | 67 | hammersmith_and_fulham | private_room_in_townhouse | private_room | 4.76 | 4.66 | 2 |
| 13913 | False | 1 | 1 | 1.0 | 1 | 2 | 14 | 79 | False | True | False | 78 | islington | private_room_in_rental_unit | private_room | 4.69 | 4.69 | 3 |
| 15400 | False | 2 | 1 | 1.0 | 7 | 1 | 14 | 107 | True | True | False | 2 | kensington_and_chelsea | entire_rental_unit | entire_home_apt | 4.93 | 4.74 | 4 |
# split the datasets into dependent and independent datasets
y = listing_clean['price_mean'].copy()
X = listing_clean.drop('price_mean', axis=1)
# we want different preprocessing steps for numeric vs categorical features
numerical_features = X.select_dtypes(include='number').columns.tolist()
print('Numerical features {}'.format(numerical_features))
categorical_features = X.select_dtypes(exclude='number').columns.tolist()
print('Categorical features {}'.format(categorical_features))
Numerical features ['accommodates', 'bedrooms', 'bathrooms', 'minimum_minimum_nights', 'calculated_host_listings_count', 'host_years_active', 'availability_90', 'review_scores_location', 'review_scores_value', 'property_cluster'] Categorical features ['host_is_superhost', 'private_bathroom', 'host_same_location', 'is_business', 'neighbourhood_cleansed', 'property_type', 'room_type']
As we will use cross validation in steps below, we want to avoid that each fold end up with different features when we drop sparse columns after one-hot encoding. We therefore do only the onehot encoding step before modelling, using the full datasets to ensure consistency. As we now have a property cluster, we will only keep the top property types and get rid of exceptions to minimize interactions
# cleaning data / preprocessing step before we split out data into train/test
onehot_pipe = Pipeline(steps=[
('one-hot', OneHotEncoder(dtype='int',
drop="if_binary",
handle_unknown='infrequent_if_exist',
sparse_output=False,
min_frequency=40))
])
onehot_preprocessor = ColumnTransformer(transformers=[
('category', onehot_pipe, categorical_features),
],
verbose_feature_names_out=False,
remainder = 'passthrough'
)
_ = onehot_preprocessor.fit(X).set_output(transform="pandas")
X_transformed = onehot_preprocessor.transform(X)
X_transformed.head()
| host_is_superhost_True | private_bathroom_True | host_same_location_True | is_business_True | neighbourhood_cleansed_barking_and_dagenham | neighbourhood_cleansed_barnet | neighbourhood_cleansed_bexley | neighbourhood_cleansed_brent | neighbourhood_cleansed_bromley | neighbourhood_cleansed_camden | neighbourhood_cleansed_city_of_london | neighbourhood_cleansed_croydon | neighbourhood_cleansed_ealing | neighbourhood_cleansed_enfield | neighbourhood_cleansed_greenwich | neighbourhood_cleansed_hackney | neighbourhood_cleansed_hammersmith_and_fulham | neighbourhood_cleansed_haringey | neighbourhood_cleansed_harrow | neighbourhood_cleansed_havering | neighbourhood_cleansed_hillingdon | neighbourhood_cleansed_hounslow | neighbourhood_cleansed_islington | neighbourhood_cleansed_kensington_and_chelsea | neighbourhood_cleansed_kingston_upon_thames | neighbourhood_cleansed_lambeth | neighbourhood_cleansed_lewisham | neighbourhood_cleansed_merton | neighbourhood_cleansed_newham | neighbourhood_cleansed_redbridge | neighbourhood_cleansed_richmond_upon_thames | neighbourhood_cleansed_southwark | neighbourhood_cleansed_sutton | neighbourhood_cleansed_tower_hamlets | neighbourhood_cleansed_waltham_forest | neighbourhood_cleansed_wandsworth | neighbourhood_cleansed_westminster | property_type_entire_bungalow | property_type_entire_condo | property_type_entire_cottage | property_type_entire_guest_suite | property_type_entire_guesthouse | property_type_entire_home | property_type_entire_loft | property_type_entire_place | property_type_entire_rental_unit | property_type_entire_serviced_apartment | property_type_entire_townhouse | property_type_entire_vacation_home | property_type_private_room | property_type_private_room_in_bed_and_breakfast | property_type_private_room_in_bungalow | property_type_private_room_in_casa_particular | property_type_private_room_in_condo | property_type_private_room_in_guest_suite | property_type_private_room_in_guesthouse | property_type_private_room_in_home | property_type_private_room_in_hostel | property_type_private_room_in_loft | property_type_private_room_in_rental_unit | property_type_private_room_in_serviced_apartment | property_type_private_room_in_townhouse | property_type_private_room_in_vacation_home | property_type_room_in_aparthotel | property_type_room_in_boutique_hotel | property_type_room_in_hotel | property_type_room_in_serviced_apartment | property_type_shared_room_in_home | property_type_shared_room_in_rental_unit | property_type_infrequent_sklearn | room_type_entire_home_apt | room_type_hotel_room | room_type_private_room | room_type_shared_room | accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | availability_90 | review_scores_location | review_scores_value | property_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 157714 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1.5 | 4 | 1 | 12 | 67 | 4.76 | 4.66 | 2 |
| 13913 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1.0 | 1 | 2 | 14 | 78 | 4.69 | 4.69 | 3 |
| 15400 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1 | 1.0 | 7 | 1 | 14 | 2 | 4.93 | 4.74 | 4 |
| 306333 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 2 | 1.0 | 2 | 1 | 11 | 4 | 4.89 | 4.53 | 8 |
| 159736 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 1.0 | 4 | 4 | 12 | 2 | 4.34 | 4.66 | 2 |
# make sure property types are well presented in both training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed,
y,
test_size=0.25,
random_state=random_state,
stratify=X_transformed[['property_cluster', 'is_business_True']],
)
# pre-processing pipeline - scaling and impute data after splitting data into training and testing sets
numerical_preprocessor = Pipeline(steps=[
('impute', SimpleImputer(strategy='median')),
('scale', StandardScaler())
])
preprocessor = ColumnTransformer(transformers=[
('number', numerical_preprocessor, numerical_features),
],
verbose_feature_names_out=False,
remainder = 'passthrough'
)
Modelling Step 1 - Setting a baseline
2 baseline models will be created, 1 with target log transformation to make the distribution normal, and 1 without
As we are observed lots of interactions during EDA in the data when we introduce features neighbourhood and room_type, we will use the Ridge algorithm with regularization to minimize the impact
# As our base model, lets evaluate the performance of Ridge estimator with a very small alpha, with and without target log transformation
alpha_base=1e-10
# without target transformation
base_model = Pipeline(
[
("preprocess", preprocessor),
('selector', VarianceThreshold(0)),
("base", Ridge(alpha_base)),
]
)
# with target transformation on log scale with base 10
base_model_with_log_target = Pipeline(
[
("preprocess", preprocessor),
('selector', VarianceThreshold(0)),
("base_transformed", TransformedTargetRegressor(regressor=Ridge(alpha_base),
func=log_transform,
inverse_func=inverse_log_transform)),
]
)
_ = base_model.fit(X_train, y_train).set_output(transform="pandas")
_ = base_model_with_log_target.fit(X_train, y_train).set_output(transform="pandas")
Evaluation: base model¶
# predict
y_train_preds = base_model.predict(X_train)
y_train_preds_transformed = base_model_with_log_target.predict(X_train)
# transform X_train
X_train_transformed = preprocessor.transform(X_train)
X_train_transformed.head(3)
| accommodates | bedrooms | bathrooms | minimum_minimum_nights | calculated_host_listings_count | host_years_active | availability_90 | review_scores_location | review_scores_value | property_cluster | host_is_superhost_True | private_bathroom_True | host_same_location_True | is_business_True | neighbourhood_cleansed_barking_and_dagenham | neighbourhood_cleansed_barnet | neighbourhood_cleansed_bexley | neighbourhood_cleansed_brent | neighbourhood_cleansed_bromley | neighbourhood_cleansed_camden | neighbourhood_cleansed_city_of_london | neighbourhood_cleansed_croydon | neighbourhood_cleansed_ealing | neighbourhood_cleansed_enfield | neighbourhood_cleansed_greenwich | neighbourhood_cleansed_hackney | neighbourhood_cleansed_hammersmith_and_fulham | neighbourhood_cleansed_haringey | neighbourhood_cleansed_harrow | neighbourhood_cleansed_havering | neighbourhood_cleansed_hillingdon | neighbourhood_cleansed_hounslow | neighbourhood_cleansed_islington | neighbourhood_cleansed_kensington_and_chelsea | neighbourhood_cleansed_kingston_upon_thames | neighbourhood_cleansed_lambeth | neighbourhood_cleansed_lewisham | neighbourhood_cleansed_merton | neighbourhood_cleansed_newham | neighbourhood_cleansed_redbridge | neighbourhood_cleansed_richmond_upon_thames | neighbourhood_cleansed_southwark | neighbourhood_cleansed_sutton | neighbourhood_cleansed_tower_hamlets | neighbourhood_cleansed_waltham_forest | neighbourhood_cleansed_wandsworth | neighbourhood_cleansed_westminster | property_type_entire_bungalow | property_type_entire_condo | property_type_entire_cottage | property_type_entire_guest_suite | property_type_entire_guesthouse | property_type_entire_home | property_type_entire_loft | property_type_entire_place | property_type_entire_rental_unit | property_type_entire_serviced_apartment | property_type_entire_townhouse | property_type_entire_vacation_home | property_type_private_room | property_type_private_room_in_bed_and_breakfast | property_type_private_room_in_bungalow | property_type_private_room_in_casa_particular | property_type_private_room_in_condo | property_type_private_room_in_guest_suite | property_type_private_room_in_guesthouse | property_type_private_room_in_home | property_type_private_room_in_hostel | property_type_private_room_in_loft | property_type_private_room_in_rental_unit | property_type_private_room_in_serviced_apartment | property_type_private_room_in_townhouse | property_type_private_room_in_vacation_home | property_type_room_in_aparthotel | property_type_room_in_boutique_hotel | property_type_room_in_hotel | property_type_room_in_serviced_apartment | property_type_shared_room_in_home | property_type_shared_room_in_rental_unit | property_type_infrequent_sklearn | room_type_entire_home_apt | room_type_hotel_room | room_type_private_room | room_type_shared_room | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| listing_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 899815776871362256 | 0.553869 | 0.704052 | 1.313731 | -0.256894 | -0.302636 | 0.851341 | 0.557023 | 0.237322 | 0.224832 | 0.295509 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 50224111 | 0.553869 | -0.517296 | -0.568225 | -0.367192 | -0.302636 | 0.237242 | 1.413330 | 0.673890 | -0.221122 | 0.088514 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46602756 | -0.632473 | 0.704052 | -0.568225 | 0.074002 | -0.302636 | -0.990957 | -1.060445 | 0.482891 | -0.836964 | 0.088514 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Mutual Information (MI) - Feature Importance¶
Mutual Information (MI) is a way to show how every feature interact with the target variable in this case 'price'. To interact means how a particular feature changes the target variable. Note: Mutual Information only works with numerical data
A higher value of MI means a closer connection between the feature and the target indicating feature importance for training the model. However, the lower the MI score like 0 indicates a weak connection between the feature and the target.
# display Mutual Information scores of the features
mi_score = get_mi_score(X_train_transformed, y_train)
# Show to 25 most important features
topn = 25
plt.figure(figsize=(6, 8))
ax = sns.barplot(y=mi_score.index[:topn], x=mi_score[:topn])
plt.xlabel('MI score')
ax.set_title('Mutual Information (MI) scores')
plt.show()
Test for multicollinearity using Variance Inflation Factor (VIF)¶
Multicollinearity means, there are linear relationships or correlation between independent variables, which should not be the case for linear regression to be effective.
Typically, a variable with VIF greater than 5 indicates Multicollinearity.
# gather features
vip_features = " + ".join(X_train_transformed.columns)
# merge X and y as required by statsmodels
vip_df = X_train_transformed.merge(y_train, left_index=True, right_index=True)
sm_y, sm_X = dmatrices('price_mean ~' + vip_features, vip_df, return_type='dataframe')
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(sm_X.values, i) for i in range(sm_X.shape[1])]
vif['features'] = sm_X.columns
vif
Observation:
- For VIF to be 'inf' means R2 must be 1. As the formula 1/1-R2, we end up dividing by 0, ending up with 'inf' values. It means we have high multicollinearity.
- Lots of interactions between neighborhood and room type exist, ideally we should drop 1 of these variables. But as we are using ridge algorithm, lets observe how regularization can take care of this problem.
- As
accommodates,bedroomsandbathroomshave VIP below 5, I assume it's ok to keep all 3 variables, even though they have some correlation.
Plot model performance¶
plot_model_performance_scatter(
y_train,
y_train_preds,
y_train_preds_transformed,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='BASE MODEL Performance - Ridge regression (alpha = {})'.format(alpha_base),
kind='actual_vs_predicted',
lower_lim=-100,
upper_lim=1500,
interval=100,
)
# Lets zoom a bit to the most likely price range, filtering the luxury listings out
plot_model_performance_scatter(
y_train,
y_train_preds,
y_train_preds_transformed,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='BASE MODEL Performance ZOOMED - Ridge regression (alpha = {})'.format(alpha_base),
kind='actual_vs_predicted',
lower_lim=-25,
upper_lim=300,
interval=25,
)
Observation:
- Model without log transformation, are frequently under-predicting, and even comes up with negative price predictions. Overall prediction are better, which is quite strange ??!!
- Model with log transformation are doing excellent with most likely predictions between £100 - £500, but overpredict outlier budget listings below £75 and luxury listings over £500 which gives it poorer overall scores
plot_model_performance_hist(
y_train,
y_train_preds,
y_train_preds_transformed,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='BASE MODEL Actual vs Predicted - Ridge regression (alpha = {})'.format(alpha_base),
)
Observation: Log transformed model shows a better prediction exponential shape. Overall the model is doing great, but over fit on budget listings below £75
Linear Assumption Test: Normal distribution of residuals¶
Residuals should have a constant variance, with a mean around 0
distribution_of_errors(
model1=base_model,
model2=base_model_with_log_target,
X=X_train,
y=y_train,
kind='kde',
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='Distribution of residuals - Ridge regression with small regularization (alpha = {})'.format(alpha_base),
)
distribution_of_errors(
model1=base_model,
model2=base_model_with_log_target,
X=X_train,
y=y_train,
kind='hist',
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='Distribution of residuals - Ridge regression (alpha = {})'.format(alpha_base),
)
Observation: Log transformed model shows a bigger density around mean = 0, which is great, however the error range between 250 and -250 is bigger
Linear Assumption Test: Homoscedasticity¶
Residuals should have a constant variance, with a mean around 0
# Assumption #3: Homoscedasticity - residuals have a constant variance
plot_model_performance_scatter(
y_train,
y_train_preds,
y_train_preds_transformed,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='Linear Assumption Test: Homoscedasticity (alpha = {})'.format(alpha_base),
kind='residual_vs_predicted',
lower_lim=0,
upper_lim=2000,
interval=100,
)
residual_variance(
model1=base_model,
model2=base_model_with_log_target,
X=X_train,
y=y_train,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='Homoscedasticity of Error Terms (alpha = {})'.format(alpha_base),
figsize=(14,4),
)
Observation: Both models have a constant variance for most listings, however there are in both cases some outliers. The variance is bigger where the model was log transformed
Inspect Model Coefficients¶
coef = plot_coef(
model1=base_model,
model2=base_model_with_log_target,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='Model coeffients - Ridge regression (alpha = {})'.format(alpha_base),
figsize=(12,20),
logscale1=False,
logscale2=True,
)
The plot above tells us about dependencies between a specific feature and the target when all other features remain constant, i.e., conditional dependencies.
Without log scale conversion, these coefficients represent actual pounds increase or decrease in price.
On log scale, these coefficients represent the % increase or decrease in price.
- For example 1.22 means a 22% increase in price for every 1 unit increase in that variable of all other variables stay constant.
- For example 0.8 means actually a 20% decrease in price for every 1 unit increase in that variable of all other variables stay constant.
Warning
Coeffients for model without log transformation makes little since. Intercept is high +/- £200 per night. An increase of ENTIRE HOMES will induce a decrease of the PRICE when all other features remain constant. On the contrary, an increase of the SHARED ROOMS will induce an increase of the PRICE when all other features remain constant. This contradicts what we saw in exploratory analysis above.
Coefficients for model with log transformation looks better. Intercept seems reasonable at around £115. We have some strange coefficients signs here as well. Example shared room in house would be expected to have coefficient value below 1.
Variability of the coefficients¶
We can check the coefficient variability through cross-validation: it is a form of data perturbation (related to resampling.
If coefficients vary significantly when changing the input dataset their robustness is not guaranteed, and they should probably be interpreted with caution.
cv_results_base_log = coefficient_variability(
model=base_model_with_log_target,
X=X_train,
y=y_train,
splits=10,
repeats=10,
jobs=-1,
scoring=('r2', 'neg_mean_squared_error'),
figsize=(12,18),
title='WITH target transformation',
suptitle='BASE MODEL Coefficient variablility - Ridge model (alpha = {})'.format(alpha_base),
random_state=random_state
)
# Print average scores during cross validation
print('Avg MSE score: {} with std {}'.format(cv_results_base_log['test_neg_mean_squared_error'].mean(), cv_results_base_log['test_neg_mean_squared_error'].std()))
print('Avg R2: {} with std {}'.format(cv_results_base_log['test_r2'].mean(), cv_results_base_log['test_r2'].std()))
Observation: We can observe high variability in room_type, and some minor variability in some neighbourhoods
QQ plot (short for Quantile-Quantile)¶
As the name suggests, it depicts the quantiles of the observed distribution (residuals in this case) against the quantiles of a reference distribution, typically the standard normal distribution.
- We plot the quantiles of the two distributions against each other.
- Deviations from the straight line indicate a departure from the assumed distribution.
A good QQ plot will:
Show minimal deviations from the reference line, indicating that the residuals are approximately normally distributed.
A bad QQ plot will:
Exhibit significant deviations, indicating a departure from the normality of residuals. Display patterns of skewness with its diverging ends, etc.
Thus, the more aligned the QQ plot looks, the more confident you can be about your model.
This is especially useful when the regression line is difficult to visualize, i.e., in a high-dimensional dataset.
#QQ plot
compare_qqplot(
base_model,
base_model_with_log_target,
X_train,
y_train,
title1='WITHOUT target transformation',
title2='WITH target transformation',
suptitle='BASE model QQ plot - Ridge regression',
figsize=(12,4)
)
from sklearn.model_selection import KFold
cv = KFold(n_splits=10, shuffle=True)
# manually set alpha quite high to see the effect.
alphas = (3000, 4000, 5000)
ridgecv = Pipeline(
[
("preprocess", preprocessor),
('selector', VarianceThreshold(0)),
("regressor", TransformedTargetRegressor(regressor=RidgeCV(alphas=alphas,
scoring='neg_mean_squared_error',
cv=cv),
func=log_transform,
inverse_func=inverse_log_transform)),
]
)
ridgecv.fit(X_train, y_train).set_output(transform="pandas")
First we check which value of $\alpha$ has been selected.
alpha_ridgecv = ridgecv.named_steps.regressor.regressor_.alpha_
print('alpha chosen: {}'.format(alpha_ridgecv))
# Best R2 score during fitting
print('Best R2 score during fitting: {}'.format(ridgecv.named_steps.regressor.regressor_.best_score_))
Evaluation: RidgeCV¶
def model_evaluation(model1, model2, X, y, alpha1=None, alpha2=None, title1=None, title2=None, suptitle=None):
title1 = title1 + ' (alpha = {})'.format(alpha1)
title2 = title2 + ' (alpha = {})'.format(alpha2)
# prediction
y_pred_model1 = model1.predict(X)
y_pred_model2 = model2.predict(X)
# model performance scatter plot: actual vs predicted
plot_model_performance_scatter(
y,
y_pred_model1,
y_pred_model2,
title1=title1,
title2=title2,
suptitle=suptitle + ' Model performance - Actual vs Predicted',
kind='actual_vs_predicted',
lower_lim=-100,
upper_lim=2000,
interval=100,
figsize=(12,6)
)
# model performance scatter plot ZOOMED: actual vs predicted
plot_model_performance_scatter(
y,
y_pred_model1,
y_pred_model2,
title1=title1,
title2=title2,
suptitle=suptitle + ' Model performance - Actual vs Predicted',
kind='actual_vs_predicted',
lower_lim=0,
upper_lim=300,
interval=25,
figsize=(12,6)
)
# model performance histogram plot: actual vs predicted
plot_model_performance_hist(
y,
y_pred_model1,
y_pred_model2,
title1=title1,
title2=title2,
suptitle=suptitle + ' Model performance - Actual vs Predicted',
price_max=1500,
price_interval=30,
figsize=(12,12)
)
# Linear Assumption Test: Distribution of Residuals'
distribution_of_errors(
model1=model1,
model2=model2,
X=X,
y=y,
kind='kde',
title1=title1,
title2=title2,
suptitle=suptitle + ' Linear Assumption Test: Distribution of Residuals',
figsize=(12,6)
)
distribution_of_errors(
model1=model1,
model2=model2,
X=X,
y=y,
kind='hist',
title1=title1,
title2=title2,
suptitle=suptitle + ' Linear Assumption Test: Distribution of Residuals',
figsize=(12,6)
)
# Linear Assumption Test: Homoscedasticity - residuals should have a constant variance with mean = 0
plot_model_performance_scatter(
y,
y_pred_model1,
y_pred_model2,
title1=title1,
title2=title2,
suptitle=suptitle + ' Linear Assumption Test: Homoscedasticity of Error Terms',
kind='residual_vs_predicted',
lower_lim=0,
upper_lim=1500,
interval=100,
figsize=(12,6)
)
residual_variance(
model1=model1,
model2=model2,
X=X,
y=y,
title1=title1,
title2=title2,
suptitle=suptitle + ' Linear Assumption Test: Homoscedasticity of Error Terms',
figsize=(14,4),
)
model_evaluation(
model1=ridgecv,
model2=base_model_with_log_target,
X=X_train,
y=y_train,
alpha1=alpha_ridgecv,
alpha2=alpha_base,
title1='RidgeCV',
title2='Base Model - Ridge with log transformed target',
suptitle='TRAIN Airbnb {}: '.format(date_stamp)
)
Inspect Model Coefficients¶
# on log scale, %
coef = plot_coef(
model1=ridgecv,
model2=base_model_with_log_target,
title1='RidgeCV' + ' (alpha = {})'.format(alpha_ridgecv),
title2='Base Model - Ridge with log transformed target' + ' (alpha = {})'.format(alpha_base),
suptitle='Model Coefficents Importance',
logscale1=True,
logscale2=True,
figsize=(12,20),
)
Observations:
On log scale, the coefficients contain the % increase in price, of all other variables stay constant. For example private_bathroom_True with of 1.082484 means for each increase in bathroom, price will go up by 8.24% if all other variables stay constant
Explanation how to interpret coefficients from Udacity mentor, I hope I got it right!
Intercept on Original Scale: The intercept value in your regression model is 2.12. When this is converted back to the original scale of the average price, it becomes approximately 130.83. This implies that if all other predictor variables are set to zero, the base predicted average price per day would be around £130.83.
Multiplier for accommodates Coefficient: The coefficient for the variable accommodates in your model is 0.0876. On the logarithmic scale, this coefficient signifies that for each unit increase in the accommodates variable, the log-transformed price is expected to increase by 0.0876. When this is translated back to the original price scale, it means that for each additional person accommodated, the predicted average price is multiplied by approximately 1.22. So, if accommodating one additional person, the price increases by about 22%.
I therefore interpret that these coefficients, when converted back, represent a percentage
Variability of the coefficients¶
cv_results = coefficient_variability(
model=ridgecv,
X=X_train,
y=y_train,
splits=10,
repeats=10,
jobs=-1,
scoring=('r2', 'neg_mean_squared_error'),
figsize=(12,18),
title='RidgeCV' + ' (alpha = {})'.format(alpha_ridgecv),
suptitle='Coefficient importance and its variability',
random_state=random_state
)
# Print average scores during cross validation
print('Avg MSE score: {} with std {}'.format(cv_results['test_neg_mean_squared_error'].mean(), cv_results['test_neg_mean_squared_error'].std()))
print('Avg R2: {} with std {}'.format(cv_results['test_r2'].mean(), cv_results['test_r2'].std()))
QQ plot (short for Quantile-Quantile)¶
#QQ plot
compare_qqplot(
ridgecv,
base_model_with_log_target,
X_train,
y_train,
title1='RidgeCV' + ' (alpha = {})'.format(alpha_ridgecv),
title2='Base Model - Ridge with log transformed target' + ' (alpha = {})'.format(alpha_base),
suptitle='QQ plot',
figsize=(12,4)
)
model_evaluation(
model1=ridgecv,
model2=base_model_with_log_target,
X=X_test,
y=y_test,
alpha1=alpha_ridgecv,
alpha2=alpha_base,
title1='RidgeCV',
title2='Base Model - Ridge with log transformed target',
suptitle='TEST Airbnb {}: '.format(date_stamp)
)
Further evaluation with Airbnb 06/09/2023 data (optional)
As an extra step, use the latest Airbnb data released on 06 September 2023 to test the model performance. This data was never seen before during development of this model, we can see how the model generalize on a brand-new dataset! We will be using the already fitted model on June 2023 data.
Notebook 1 and 2 was run manually in sequence by changing the date stamp parameter from 060623 to 090623
A new enhanced file was generated with suffix '_090623' in data/clean folder
# read latest airbnb listings file
listing_latest = pd.read_pickle(FILE_NAME_ENHANCED_LATEST)
listing_latest.head(2)
# for easy troubleshooting, lets make listing_id the index
listing_latest = listing_latest.set_index('listing_id')
listing_latest = listing_latest.loc[:, main_cols]
listing_latest.head(2)
# cluster, clean and split data
X_train_060923, X_test_060923, y_train_060923, y_test_060923 = cleaning(listing_latest, cluster_preprocessor, gmm, onehot_preprocessor)
# evaluate latest airbnb file using models trained with June 2023 data
model_evaluation(
model1=ridgecv,
model2=base_model_with_log_target,
X=X_train_060923,
y=y_train_060923,
alpha1=alpha_ridgecv,
alpha2=alpha_base,
title1='RidgeCV',
title2='Base Model - Ridge with log transformed target',
suptitle='TEST Airbnb {}: '.format(date_stamp_latest)
)
# save fitted models
dump(cluster_preprocessor, open('../data/models/cluster_preprocessor.pkl', 'wb'))
dump(gmm, open('../data/models/gmm.pkl', 'wb'))
dump(onehot_preprocessor, open('../data/models/onehot_preprocessor.pkl', 'wb'))
Linear regression modelling was selected as the target variable, as average rental price per night, is a continues variable. The target variable was converted to a log10 scale to be normally distributed, to meet the linear regression criteria.
A successful linear regression data science model using algorithm 'RidgeCV' was build to predict prices with a median absolute error of only £11 per night. This means, prices are accurately predicted within a +/- £11 range. R2 score on the test dataset was 0.976, which means the model explains 98% of the variability of the data, which is totally amazing. RidgeCV was chosen to minimize interactions introduced when adding features neighbourhood, room type and property type. We need a high level regularization with alpha at 3000 to obtain the low level of errors achieved.
To reduce model complexity, property type was summarized through clusterng. We have over 80 property types, which could make the model after one-hot encoding too complex and cause over fitting. We could use room type instead, but this has only 4 categories with so much variation in price, it's too simple for modelling. Room type, Property type, location coordinates (latitude, longitude) and price was used to cluster the properties into price groups.
Gausian mixture model (GMM) algorithm was used for clustering, as it will look for clusters that have a normal price distribution within each cluster, which will provide a better split.
The average rental price in London does not easily exceed £500, however outliers up to £1500 was kept to observe model behaviour. These outliers can be removed for further improvement subsequently. Generally, the model prediction up to £500 seems excellent if one observe the model performance actual vs predicted, residuals are perfectly linear up to +/£250
As a last step, the model was evaluated with a new dataset with time stamp 06/09/2023, which became available subsequently. Outliers was not removed as during modelling, so the model has to cope with exceptional prices below £30 and above £1500. Mean absolute error was a little higher at £16 and R2 score was 0.956, still a great result considering outliers was not removed.
- Common pitfalls in the interpretation of coefficients of linear models
- Mixed-type imputation for IterativeImputer
- kaggle feature engineering lesson
- boston house prices
- linear regression assumptions
- Regularization of linear regression model
- pipeline for multiple predictors
- when to use gamma GLMS
- feature engineering - clustering with K-means
- using FACETGRID to plot geopandas maps
- QQ plots
- Interpreting skew and kurtosis
# convert notebook to html
os.system('jupyter nbconvert --to html 4_Price_Prediction_latest.ipynb')